A Structural Approach for Segmentation of Handwritten Hindi Text
نویسندگان
چکیده
This paper makes an attempt to segment the handwritten Hindi words. The problem of segmentation is compounded by the possible presence of modifiers (matras) on all sides of the basic characters and due to the uncertainty introduced in the character shapes by way of different writing styles. We have devised a structural approach to capture the similarities and differences between structure classes. The segmentation is performed in hierarchical order: 1) Separating the upper modifiers and header line from the character, 2) Detecting and then segmenting lower modifiers from the characters, 3) Identifying whether a character is conjunct or not, 4) Categorization of top modifiers based on Check_point, Mid_point and Touching_points. The segmentation accuracy has been found to be around 75%. We have applied general conditions for separating matras from the characters. But certain words can’t be segmented because they violate the general conditions. However, specifics are not dealt with in this paper because such an attempt requires an exhaustive study on a large database that is not available presently. 1.0 INTRODUCTION The basic character set of Devanagari script is very large comprising 11vowels and 33consonants [3, 5]. Note that Hindi text uses Devanagari script. Handwritten Hindi word segmentation has been a challenging problem since characters of handwritten words do not have a fixed size and shapes. So they are quite different from the printed characters. In the case of printed words, vertical bar of a character occupies a single column whereas that of handwritten words might occupy more than one column. For segmentation of printed words, statistical information is used in the literature [3, 5]. The statistical information is also used for classifying the printed conjunct characters into consonants and half consonants [1]. The methods of printed words fail to work on the handwritten words. We are therefore inclined to develop methods that can work both on the printed and handwritten words. Segmentation is a technique, which partitions handwritten Hindi text or words into individual characters. Since recognition heavily relies on isolated characters, segmentation is a critical step for character recognition because incorrect segmentation may lead to incorrect character recognition. The organization of the paper is as follows. Section 2 describes the segmentation of Hindi words. Section 3 gives the identification and separation of conjunct characters. These two sections also provide some examples to illustrate the methodology. Section 4 gives definitions of certain terms required in the categorization of modifiers. Finally, conclusions are drawn in Section 5. 589 Proceedings of the International Conference on Cognition and Recognition
منابع مشابه
The Hazards in Segmentation of Handwritten Hindi Text
Optical Character Recognition (OCR) is a process to recognize the handwritten or printed scanned text with the help of a computer. Segmentation is very important stage of any text recognition system. The problems in segmentation can lead to decrease in segmentation rate and hence recognition rate. A good segmentation technique can improve the recognition rate. This paper deals with the hazards ...
متن کاملDistinction between Machine Printed Text and Handwritten Text in a Document
In many documents machine printed& handwritten texts are intermixed .Optical Character Recognition (OCR) techniques are different for machine printed and handwritten text, so it is necessary to separate these text before giving input to the OCR. In this paper we are proposing methodology for Hindi language. This methodology is based on structural features of text. Experimental results on a data...
متن کاملLanguage identification for handwritten document images using a shape codebook
Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each repres...
متن کاملConnected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملA Holistic Approach for Handwritten Hindi Word Recognition
Holistic word recognition attempts to recognize the entire word image as a single pattern. In general, it performs better than segmentation based word recognition model for known, fixed and small sized lexicon. The present work deals with recognition of handwritten words in Hindi in holistic way. Features like area, aspect ratio, density, pixel ratio, longest run, centroid and projection length...
متن کامل